This document details visualization in anvio
Anvio is a program for visualizing metagenomic data and creating a pangenome. Anvio is run in a dedicated environment.
conda activate anvio-7.1
Get bin info into a format that anvio can use. This means concatenating the bin files for each method, so there’s a list of which contig/read goes in which bin.
# get all bin directories
path <- list.dirs("../data/Bins")
# for loop for each binning method
for (i in 2:7){
DF <- NULL
pathname <- path[i]
filelist <- list.files(paste0(pathname, "/"))
# get list of all contigs and reads for all bins into 1 tsv file
for (filename in filelist){
df <- read.csv(paste0(pathname, "/", filename), header = F)
df <- as.data.frame(df)
colnames(df) <- "read"
df$bin <- str_replace(filename, "[.]", "_")
# change names if a number is at the beginning of the bin name
if (basename(pathname) == "24_sample_bam_bins"){
df$bin <- str_replace(df$bin, "24", "twentyfour")
}
if (basename(pathname) == "47_sample_bam_bins"){
df$bin <- str_replace(df$bin, "47", "fortyseven")
}
DF <- rbind(DF, df)
}
write.table(DF, paste0("../output/all_bins/", basename(pathname), ".tsv"), row.names = F, col.names = F, quote = F, sep = "\t")
}
Examine output tsv files.
tsv_output <- read.csv("../output/all_bins/assembly_bins.tsv", sep = "\t")
kable(head(tsv_output, 5))
| MG1058_s821.ctg000852l | assembly_bin_1 |
|---|---|
| MG1058_s1105.ctg001148l | assembly_bin_1 |
| MG1058_s1585.ctg001645l | assembly_bin_1 |
| MG1058_s1820.ctg001893l | assembly_bin_1 |
| MG1058_s645.ctg000674l | assembly_bin_10 |
| MG1058_s914.ctg000951l | assembly_bin_10 |
Get the bin information into the anvio database already created.
# Example for one bin import, change import and -C for each
anvi-import-collection "./github/jordan-marinimicrobia/output/all_bins/short_reads_bam_bins.tsv" -p "./Downloads/plus_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_plus/1058_P1_2018_585_0.2um_assembly_plus.db" --contigs-mode -C shortreads
Use the interactive browser to visualize the metagenome and bins.
anvi-interactive -p "./Downloads/plus_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_plus/1058_P1_2018_585_0.2um_assembly_plus.db"
This is an example of what the interactive browser looks like with bins. Anvio calculates statistics like completion and redundancy for each bin.
Anvio interactive browser with bins
Dig into “contaminated” bins to see how/why they are contaminated. Reminder that the naming scheme for bins has different syntax– “.” is changed to “_” and 24 and 47 are written out in the anvi bin database.
anvi-refine -p "./Downloads/assembly_PROFILE.db" -c "./Library/CloudStorage/GoogleDrive-jwinter2@uw.edu/Shared drives/Rocap Lab/Project_ODZ_Marinimicrobia_Jordan/Anvio/assembly_only/1058_P1_2018_585_0.2um_assembly.db" -C shortreads -b short_reads_bam_bin_163
Example of a contaminated bin. The coverage is not consistent, there are many branches in the clustering algorithm anvio uses to group sequences, and there are a plethora of duplicated single copy core genes.
Anvio interactive display of a contaminated bin
I used anvi-interactive to get a summary of all bins in each bin collection. This is an example output file that contains information on size of bins, contamination, etc.
summary <- read.table("../output/anvio_outputs/assembly_plus_summary.txt", sep = "\t", header = T)
summary(summary)
## bins total_length num_contigs N50
## Length:274 Min. : 202581 Min. : 1.00 Min. : 10203
## Class :character 1st Qu.: 310692 1st Qu.: 9.00 1st Qu.: 12386
## Mode :character Median : 488296 Median : 23.00 Median : 14620
## Mean : 879731 Mean : 52.00 Mean : 74294
## 3rd Qu.: 896170 3rd Qu.: 50.75 3rd Qu.: 50200
## Max. :20079188 Max. :1582.00 Max. :3034959
## GC_content percent_completion percent_redundancy t_domain
## Min. :26.47 Min. : 0.00 Min. : 0.00 Length:274
## 1st Qu.:38.99 1st Qu.: 0.00 1st Qu.: 0.00 Class :character
## Median :45.84 Median : 0.00 Median : 0.00 Mode :character
## Mean :47.44 Mean : 13.16 Mean : 16.19
## 3rd Qu.:57.02 3rd Qu.: 23.59 3rd Qu.: 0.00
## Max. :69.50 Max. :100.00 Max. :2053.52
## t_phylum t_class t_order t_family
## Length:274 Length:274 Length:274 Length:274
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## t_genus t_species
## Length:274 Length:274
## Class :character Class :character
## Mode :character Mode :character
##
##
##
I will be making a pangenome using the assembly plus bins because these are lower in contamination but contain more contigs than just the assembly bins. In addition to my bins, I got three complete genomes from cultured Sulfitobacter pontiacus and added those to my pangenome.
I created anvio databases (dbs) for my bins and annotated them with COG, Kegg, HMMs, and tRNAs. I used the interactive database to visualize the pangenome, and find variable regions of the Sulfitobacter genome. This is what I based the next markdown document on.
anvi-gen-contigs-database -f sulf_genomes/assembly_plus_bin_4.fa -o sulfbin4.db
anvi-run-hmms -c sulf_genomes/dbs/sulfbin4.db
anvi-run-scg-taxonomy -c sulf_genomes/dbs/sulfbin4.db
anvi-scan-trnas -c sulf_genomes/dbs/sulfbin4.db
anvi-run-ncbi-cogs -c sulf_genomes/dbs/sulfbin4.db
anvi-run-kegg-kofams -c sulf_genomes/dbs/sulfbin4.db
anvi-gen-genomes-storage -e sulf-external-genomes.txt \
-o sulf-GENOMES.db
anvi-pan-genome -g sulf-GENOMES.db -n sulfitobacter
anvi-display-pan -g sulf-GENOMES.db -p sulfitobacter/sulfitobacter-PAN.db
Pangenome visualization. The variable regions are labeled, and sulf 1, 2, and 3 are the cultured genomes.
Sulfitobacter pangenome